Real-world data rarely comes clean. Using Python and its libraries, you will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. You will document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries) and/or SQL.
The dataset that you will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent."
WeRateDogs has over 4 million followers and has received international media coverage. WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for you to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon.
import numpy as np
from numpy import nan
import pandas as pd
import requests
import os
import json
import tweepy
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import glob
!pip install --upgrade tweepy==4.4
Requirement already satisfied: tweepy==4.4 in /opt/homebrew/Caskroom/miniconda/base/lib/python3.9/site-packages (4.4.0) Requirement already satisfied: requests-oauthlib<2,>=1.0.0 in /opt/homebrew/Caskroom/miniconda/base/lib/python3.9/site-packages (from tweepy==4.4) (1.3.1) Requirement already satisfied: requests<3,>=2.11.1 in /opt/homebrew/Caskroom/miniconda/base/lib/python3.9/site-packages (from tweepy==4.4) (2.27.1) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/homebrew/Caskroom/miniconda/base/lib/python3.9/site-packages (from requests<3,>=2.11.1->tweepy==4.4) (1.26.6) Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/homebrew/Caskroom/miniconda/base/lib/python3.9/site-packages (from requests<3,>=2.11.1->tweepy==4.4) (2.0.12) Requirement already satisfied: idna<4,>=2.5 in /opt/homebrew/Caskroom/miniconda/base/lib/python3.9/site-packages (from requests<3,>=2.11.1->tweepy==4.4) (2.10) Requirement already satisfied: certifi>=2017.4.17 in /opt/homebrew/Caskroom/miniconda/base/lib/python3.9/site-packages (from requests<3,>=2.11.1->tweepy==4.4) (2021.10.8) Requirement already satisfied: oauthlib>=3.0.0 in /opt/homebrew/Caskroom/miniconda/base/lib/python3.9/site-packages (from requests-oauthlib<2,>=1.0.0->tweepy==4.4) (3.2.0)
Three types of dataset will be used;
1- twitter_df : Loaded data from twitter_archive_enhanced.csv
2- images_df : Loaded data from image_predictions.tsv
3- tweet_json : Twitter API & json
twitter_df = pd.read_csv('twitter-archive-enhanced.csv')
twitter_df.tail()
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2351 | 666049248165822465 | NaN | NaN | 2015-11-16 00:24:50 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a 1949 1st generation vulpix. Enj... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666049248... | 5 | 10 | None | None | None | None | None |
| 2352 | 666044226329800704 | NaN | NaN | 2015-11-16 00:04:52 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a purebred Piers Morgan. Loves to Netf... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666044226... | 6 | 10 | a | None | None | None | None |
| 2353 | 666033412701032449 | NaN | NaN | 2015-11-15 23:21:54 +0000 | <a href="http://twitter.com/download/iphone" r... | Here is a very happy pup. Big fan of well-main... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666033412... | 9 | 10 | a | None | None | None | None |
| 2354 | 666029285002620928 | NaN | NaN | 2015-11-15 23:05:30 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a western brown Mitsubishi terrier. Up... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666029285... | 7 | 10 | a | None | None | None | None |
| 2355 | 666020888022790149 | NaN | NaN | 2015-11-15 22:32:08 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a Japanese Irish Setter. Lost eye... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666020888... | 8 | 10 | None | None | None | None | None |
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open('image_predictions.tsv', mode ='wb') as file:
file.write(response.content)
images_df = pd.read_csv('image_predictions.tsv', sep='\t', encoding = 'utf-8')
images_df.tail()
| tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2070 | 891327558926688256 | https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg | 2 | basset | 0.555712 | True | English_springer | 0.225770 | True | German_short-haired_pointer | 0.175219 | True |
| 2071 | 891689557279858688 | https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg | 1 | paper_towel | 0.170278 | False | Labrador_retriever | 0.168086 | True | spatula | 0.040836 | False |
| 2072 | 891815181378084864 | https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg | 1 | Chihuahua | 0.716012 | True | malamute | 0.078253 | True | kelpie | 0.031379 | True |
| 2073 | 892177421306343426 | https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg | 1 | Chihuahua | 0.323581 | True | Pekinese | 0.090647 | True | papillon | 0.068957 | True |
| 2074 | 892420643555336193 | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | 1 | orange | 0.097049 | False | bagel | 0.085851 | False | banana | 0.076110 | False |
auth = tweepy.OAuthHandler('X', 'X')
auth.set_access_token('X', 'X') api = tweepy.API(auth, parser = tweepy.parsers.JSONParser(), wait_on_rate_limit = True)
json_df = pd.read_json('tweet-json.txt', lines = True, encoding='utf-8')
json_df.head()
| created_at | id | id_str | full_text | truncated | display_text_range | entities | extended_entities | source | in_reply_to_status_id | ... | favorite_count | favorited | retweeted | possibly_sensitive | possibly_sensitive_appealable | lang | retweeted_status | quoted_status_id | quoted_status_id_str | quoted_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-08-01 16:23:56+00:00 | 892420643555336193 | 892420643555336192 | This is Phineas. He's a mystical boy. Only eve... | False | [0, 85] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892420639486877696, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 39467 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
| 1 | 2017-08-01 00:17:27+00:00 | 892177421306343426 | 892177421306343424 | This is Tilly. She's just checking pup on you.... | False | [0, 138] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892177413194625024, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 33819 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
| 2 | 2017-07-31 00:18:03+00:00 | 891815181378084864 | 891815181378084864 | This is Archie. He is a rare Norwegian Pouncin... | False | [0, 121] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 891815175371796480, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 25461 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
| 3 | 2017-07-30 15:58:51+00:00 | 891689557279858688 | 891689557279858688 | This is Darla. She commenced a snooze mid meal... | False | [0, 79] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 891689552724799489, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 42908 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
| 4 | 2017-07-29 16:00:24+00:00 | 891327558926688256 | 891327558926688256 | This is Franklin. He would like you to stop ca... | False | [0, 138] | {'hashtags': [{'text': 'BarkWeek', 'indices': ... | {'media': [{'id': 891327551943041024, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 41048 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
5 rows × 31 columns
To meet specifications, the following issues must be assessed.
You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.
#Visual Assesment
twitter_df
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | None | None | None | None |
| 1 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | 13 | 10 | Tilly | None | None | None | None |
| 2 | 891815181378084864 | NaN | NaN | 2017-07-31 00:18:03 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Archie. He is a rare Norwegian Pouncin... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891815181... | 12 | 10 | Archie | None | None | None | None |
| 3 | 891689557279858688 | NaN | NaN | 2017-07-30 15:58:51 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Darla. She commenced a snooze mid meal... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891689557... | 13 | 10 | Darla | None | None | None | None |
| 4 | 891327558926688256 | NaN | NaN | 2017-07-29 16:00:24 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Franklin. He would like you to stop ca... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891327558... | 12 | 10 | Franklin | None | None | None | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2351 | 666049248165822465 | NaN | NaN | 2015-11-16 00:24:50 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a 1949 1st generation vulpix. Enj... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666049248... | 5 | 10 | None | None | None | None | None |
| 2352 | 666044226329800704 | NaN | NaN | 2015-11-16 00:04:52 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a purebred Piers Morgan. Loves to Netf... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666044226... | 6 | 10 | a | None | None | None | None |
| 2353 | 666033412701032449 | NaN | NaN | 2015-11-15 23:21:54 +0000 | <a href="http://twitter.com/download/iphone" r... | Here is a very happy pup. Big fan of well-main... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666033412... | 9 | 10 | a | None | None | None | None |
| 2354 | 666029285002620928 | NaN | NaN | 2015-11-15 23:05:30 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a western brown Mitsubishi terrier. Up... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666029285... | 7 | 10 | a | None | None | None | None |
| 2355 | 666020888022790149 | NaN | NaN | 2015-11-15 22:32:08 +0000 | <a href="http://twitter.com/download/iphone" r... | Here we have a Japanese Irish Setter. Lost eye... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666020888... | 8 | 10 | None | None | None | None | None |
2356 rows × 17 columns
#programmatic assessment
twitter_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2356 entries, 0 to 2355 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2356 non-null int64 1 in_reply_to_status_id 78 non-null float64 2 in_reply_to_user_id 78 non-null float64 3 timestamp 2356 non-null object 4 source 2356 non-null object 5 text 2356 non-null object 6 retweeted_status_id 181 non-null float64 7 retweeted_status_user_id 181 non-null float64 8 retweeted_status_timestamp 181 non-null object 9 expanded_urls 2297 non-null object 10 rating_numerator 2356 non-null int64 11 rating_denominator 2356 non-null int64 12 name 2356 non-null object 13 doggo 2356 non-null object 14 floofer 2356 non-null object 15 pupper 2356 non-null object 16 puppo 2356 non-null object dtypes: float64(4), int64(3), object(10) memory usage: 313.0+ KB
We have got 2.356 entries and 17 columns. Total memory usage of the dataframe is 313.0+ KB.
#Visual Assessment
images_df
| tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 666020888022790149 | https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg | 1 | Welsh_springer_spaniel | 0.465074 | True | collie | 0.156665 | True | Shetland_sheepdog | 0.061428 | True |
| 1 | 666029285002620928 | https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg | 1 | redbone | 0.506826 | True | miniature_pinscher | 0.074192 | True | Rhodesian_ridgeback | 0.072010 | True |
| 2 | 666033412701032449 | https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg | 1 | German_shepherd | 0.596461 | True | malinois | 0.138584 | True | bloodhound | 0.116197 | True |
| 3 | 666044226329800704 | https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg | 1 | Rhodesian_ridgeback | 0.408143 | True | redbone | 0.360687 | True | miniature_pinscher | 0.222752 | True |
| 4 | 666049248165822465 | https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg | 1 | miniature_pinscher | 0.560311 | True | Rottweiler | 0.243682 | True | Doberman | 0.154629 | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2070 | 891327558926688256 | https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg | 2 | basset | 0.555712 | True | English_springer | 0.225770 | True | German_short-haired_pointer | 0.175219 | True |
| 2071 | 891689557279858688 | https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg | 1 | paper_towel | 0.170278 | False | Labrador_retriever | 0.168086 | True | spatula | 0.040836 | False |
| 2072 | 891815181378084864 | https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg | 1 | Chihuahua | 0.716012 | True | malamute | 0.078253 | True | kelpie | 0.031379 | True |
| 2073 | 892177421306343426 | https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg | 1 | Chihuahua | 0.323581 | True | Pekinese | 0.090647 | True | papillon | 0.068957 | True |
| 2074 | 892420643555336193 | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | 1 | orange | 0.097049 | False | bagel | 0.085851 | False | banana | 0.076110 | False |
2075 rows × 12 columns
#Programmatic Assessment
images_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2075 entries, 0 to 2074 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2075 non-null int64 1 jpg_url 2075 non-null object 2 img_num 2075 non-null int64 3 p1 2075 non-null object 4 p1_conf 2075 non-null float64 5 p1_dog 2075 non-null bool 6 p2 2075 non-null object 7 p2_conf 2075 non-null float64 8 p2_dog 2075 non-null bool 9 p3 2075 non-null object 10 p3_conf 2075 non-null float64 11 p3_dog 2075 non-null bool dtypes: bool(3), float64(3), int64(2), object(4) memory usage: 152.1+ KB
#Visual Assessment
json_df
| created_at | id | id_str | full_text | truncated | display_text_range | entities | extended_entities | source | in_reply_to_status_id | ... | favorite_count | favorited | retweeted | possibly_sensitive | possibly_sensitive_appealable | lang | retweeted_status | quoted_status_id | quoted_status_id_str | quoted_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-08-01 16:23:56+00:00 | 892420643555336193 | 892420643555336192 | This is Phineas. He's a mystical boy. Only eve... | False | [0, 85] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892420639486877696, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 39467 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
| 1 | 2017-08-01 00:17:27+00:00 | 892177421306343426 | 892177421306343424 | This is Tilly. She's just checking pup on you.... | False | [0, 138] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 892177413194625024, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 33819 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
| 2 | 2017-07-31 00:18:03+00:00 | 891815181378084864 | 891815181378084864 | This is Archie. He is a rare Norwegian Pouncin... | False | [0, 121] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 891815175371796480, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 25461 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
| 3 | 2017-07-30 15:58:51+00:00 | 891689557279858688 | 891689557279858688 | This is Darla. She commenced a snooze mid meal... | False | [0, 79] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 891689552724799489, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 42908 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
| 4 | 2017-07-29 16:00:24+00:00 | 891327558926688256 | 891327558926688256 | This is Franklin. He would like you to stop ca... | False | [0, 138] | {'hashtags': [{'text': 'BarkWeek', 'indices': ... | {'media': [{'id': 891327551943041024, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 41048 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2349 | 2015-11-16 00:24:50+00:00 | 666049248165822465 | 666049248165822464 | Here we have a 1949 1st generation vulpix. Enj... | False | [0, 120] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 666049244999131136, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 111 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
| 2350 | 2015-11-16 00:04:52+00:00 | 666044226329800704 | 666044226329800704 | This is a purebred Piers Morgan. Loves to Netf... | False | [0, 137] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 666044217047650304, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 311 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
| 2351 | 2015-11-15 23:21:54+00:00 | 666033412701032449 | 666033412701032448 | Here is a very happy pup. Big fan of well-main... | False | [0, 130] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 666033409081393153, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 128 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
| 2352 | 2015-11-15 23:05:30+00:00 | 666029285002620928 | 666029285002620928 | This is a western brown Mitsubishi terrier. Up... | False | [0, 139] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 666029276303482880, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 132 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
| 2353 | 2015-11-15 22:32:08+00:00 | 666020888022790149 | 666020888022790144 | Here we have a Japanese Irish Setter. Lost eye... | False | [0, 131] | {'hashtags': [], 'symbols': [], 'user_mentions... | {'media': [{'id': 666020881337073664, 'id_str'... | <a href="http://twitter.com/download/iphone" r... | NaN | ... | 2535 | False | False | 0.0 | 0.0 | en | NaN | NaN | NaN | NaN |
2354 rows × 31 columns
#Programmatic Assessment
json_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2354 entries, 0 to 2353 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 created_at 2354 non-null datetime64[ns, UTC] 1 id 2354 non-null int64 2 id_str 2354 non-null int64 3 full_text 2354 non-null object 4 truncated 2354 non-null bool 5 display_text_range 2354 non-null object 6 entities 2354 non-null object 7 extended_entities 2073 non-null object 8 source 2354 non-null object 9 in_reply_to_status_id 78 non-null float64 10 in_reply_to_status_id_str 78 non-null float64 11 in_reply_to_user_id 78 non-null float64 12 in_reply_to_user_id_str 78 non-null float64 13 in_reply_to_screen_name 78 non-null object 14 user 2354 non-null object 15 geo 0 non-null float64 16 coordinates 0 non-null float64 17 place 1 non-null object 18 contributors 0 non-null float64 19 is_quote_status 2354 non-null bool 20 retweet_count 2354 non-null int64 21 favorite_count 2354 non-null int64 22 favorited 2354 non-null bool 23 retweeted 2354 non-null bool 24 possibly_sensitive 2211 non-null float64 25 possibly_sensitive_appealable 2211 non-null float64 26 lang 2354 non-null object 27 retweeted_status 179 non-null object 28 quoted_status_id 29 non-null float64 29 quoted_status_id_str 29 non-null float64 30 quoted_status 28 non-null object dtypes: bool(4), datetime64[ns, UTC](1), float64(11), int64(4), object(11) memory usage: 505.9+ KB
twitter_df_clean = twitter_df.copy()
images_df_clean = images_df.copy()
json_df_clean = json_df.copy()
twitter_df_clean.retweeted_status_user_id.count()
181
json_df_clean['retweeted_status'].count()
179
Delete columns will not be used from all datasets.
Delete retweeted_status that are not null from all twitter_df_clean.
twitter_df:
Replace the words that not refer any names in the name columns with NaN (i.e. all words start with a lower case).
Replace None values with NaN in "doggo","floofer","pupper","puppo" columns.
Splitting the timestamp column, aiming for only having date column.
Correcting data types.
Remove the columns for tweets that retweeted("retweeted_status") to access tweets send by the account.
Correcting 'rating_denominator' column (max 10).
images_df:
json_df:
Collect all dog types in one column and delete "doggo", "floofer", "pupper" and "puppo" columns.
Merge all datasets.
type_list = ['doggo','pupper', 'floofer', 'puppo' ]
for i in type_list:
twitter_df_clean[i] = twitter_df_clean[i].replace('None', '')
twitter_df_clean['dog_type'] = twitter_df_clean.doggo.str.cat(twitter_df_clean.floofer).str.cat(twitter_df_clean.pupper).str.cat(twitter_df_clean.puppo)
twitter_df_clean = twitter_df_clean.drop(['doggo','floofer','pupper','puppo'], axis = 1)
twitter_df_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2356 entries, 0 to 2355 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2356 non-null int64 1 in_reply_to_status_id 78 non-null float64 2 in_reply_to_user_id 78 non-null float64 3 timestamp 2356 non-null object 4 source 2356 non-null object 5 text 2356 non-null object 6 retweeted_status_id 181 non-null float64 7 retweeted_status_user_id 181 non-null float64 8 retweeted_status_timestamp 181 non-null object 9 expanded_urls 2297 non-null object 10 rating_numerator 2356 non-null int64 11 rating_denominator 2356 non-null int64 12 name 2356 non-null object 13 dog_type 2356 non-null object dtypes: float64(4), int64(3), object(7) memory usage: 257.8+ KB
twitter_df_clean['dog_type'] = twitter_df_clean['dog_type'].replace('', np.nan)
twitter_df_clean.head(2)
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | dog_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | NaN |
| 1 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | 13 | 10 | Tilly | NaN |
not refer any names in the "name" i.e. all words start with a lower case
columns with None
"a"
twitter_df_clean.replace(to_replace = twitter_df_clean.name.str.islower(), value = np.nan, inplace = True)
twitter_df_clean['name'] = twitter_df_clean['name'].replace('None', np.nan)
twitter_df_clean['name'] = twitter_df_clean['name'].replace('a', np.nan)
twitter_df_clean['name'].value_counts()
Charlie 12
Cooper 11
Lucy 11
Oliver 11
Tucker 10
..
Aqua 1
Chase 1
Meatball 1
Rorie 1
Christoper 1
Name: name, Length: 955, dtype: int64
twitter_df_clean['name'].isnull().sum()
800
twitter_df_clean['name'].tail(2)
2354 NaN 2355 NaN Name: name, dtype: object
twitter_df_clean["date"] = pd.to_datetime(twitter_df_clean['timestamp']).dt.date
twitter_df_clean.drop(['timestamp'], axis=1, inplace=True)
twitter_df_clean.head(2)
| tweet_id | in_reply_to_status_id | in_reply_to_user_id | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | dog_type | date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | NaN | NaN | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | NaN | 2017-08-01 |
| 1 | 892177421306343426 | NaN | NaN | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | 13 | 10 | Tilly | NaN | 2017-08-01 |
twitter_df_clean = twitter_df_clean.drop(twitter_df_clean[twitter_df_clean.rating_denominator > 10].index)
twitter_df_clean.rating_denominator.max()
10
twitter_df_clean.drop(twitter_df_clean[twitter_df_clean['retweeted_status_user_id'].notnull()== True].index,inplace=True)
twitter_df_clean['retweeted_status_user_id'].notnull().any()
False
twitter_df_clean = twitter_df_clean.drop(['source',
'in_reply_to_status_id',
'in_reply_to_user_id',
'retweeted_status_id',
'retweeted_status_user_id',
'retweeted_status_timestamp',
'expanded_urls'], axis = 1)
twitter_df_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2156 entries, 0 to 2355 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2156 non-null int64 1 text 2156 non-null object 2 rating_numerator 2156 non-null int64 3 rating_denominator 2156 non-null int64 4 name 1437 non-null object 5 dog_type 344 non-null object 6 date 2156 non-null object dtypes: int64(3), object(4) memory usage: 134.8+ KB
json_df_clean = json_df_clean.drop(['created_at',
'source',
'full_text',
'in_reply_to_status_id',
'in_reply_to_user_id',
'in_reply_to_status_id_str',
'in_reply_to_user_id_str',
'in_reply_to_screen_name',
'truncated',
'display_text_range',
'entities',
'extended_entities',
'in_reply_to_screen_name',
'user',
'geo',
'coordinates',
'place',
'contributors',
'is_quote_status',
'favorited',
'retweeted',
'possibly_sensitive',
'possibly_sensitive_appealable',
'quoted_status_id',
'quoted_status_id_str',
'lang',
'quoted_status'], axis = 1)
json_df_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2354 entries, 0 to 2353 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 2354 non-null int64 1 id_str 2354 non-null int64 2 retweet_count 2354 non-null int64 3 favorite_count 2354 non-null int64 4 retweeted_status 179 non-null object dtypes: int64(4), object(1) memory usage: 92.1+ KB
images_df_clean = images_df_clean.drop(['img_num'], axis = 1)
images_df_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2075 entries, 0 to 2074 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2075 non-null int64 1 jpg_url 2075 non-null object 2 p1 2075 non-null object 3 p1_conf 2075 non-null float64 4 p1_dog 2075 non-null bool 5 p2 2075 non-null object 6 p2_conf 2075 non-null float64 7 p2_dog 2075 non-null bool 8 p3 2075 non-null object 9 p3_conf 2075 non-null float64 10 p3_dog 2075 non-null bool dtypes: bool(3), float64(3), int64(1), object(4) memory usage: 135.9+ KB
images_df_clean = images_df_clean.rename(columns={'jpg_url': 'image_url',
'p1': 'first_prediction',
'p1_conf': 'first_confidence',
'p1_dog': 'first_dog_prediction',
'p2': 'second_prediction',
'p2_conf': 'second_confidence',
'p2_dog': 'second_dog_prediction',
'p3': 'third_prediction',
'p3_conf': 'third_confidence',
'p3_dog': 'third_dog_prediction'})
images_df_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2075 entries, 0 to 2074 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2075 non-null int64 1 image_url 2075 non-null object 2 first_prediction 2075 non-null object 3 first_confidence 2075 non-null float64 4 first_dog_prediction 2075 non-null bool 5 second_prediction 2075 non-null object 6 second_confidence 2075 non-null float64 7 second_dog_prediction 2075 non-null bool 8 third_prediction 2075 non-null object 9 third_confidence 2075 non-null float64 10 third_dog_prediction 2075 non-null bool dtypes: bool(3), float64(3), int64(1), object(4) memory usage: 135.9+ KB
twitter_df_clean = twitter_df_clean.rename(columns={'text': 'tweet'})
twitter_df_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2156 entries, 0 to 2355 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2156 non-null int64 1 tweet 2156 non-null object 2 rating_numerator 2156 non-null int64 3 rating_denominator 2156 non-null int64 4 name 1437 non-null object 5 dog_type 344 non-null object 6 date 2156 non-null object dtypes: int64(3), object(4) memory usage: 134.8+ KB
json_df_clean = json_df_clean.rename(columns={'id': 'tweet_id'})
json_df_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2354 entries, 0 to 2353 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2354 non-null int64 1 id_str 2354 non-null int64 2 retweet_count 2354 non-null int64 3 favorite_count 2354 non-null int64 4 retweeted_status 179 non-null object dtypes: int64(4), object(1) memory usage: 92.1+ KB
duplicated = images_df_clean[images_df_clean.duplicated(['image_url'], keep = False)]
duplicated
| tweet_id | image_url | first_prediction | first_confidence | first_dog_prediction | second_prediction | second_confidence | second_dog_prediction | third_prediction | third_confidence | third_dog_prediction | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 85 | 667509364010450944 | https://pbs.twimg.com/media/CUN4Or5UAAAa5K4.jpg | beagle | 0.636169 | True | Labrador_retriever | 0.119256 | True | golden_retriever | 0.082549 | True |
| 224 | 670319130621435904 | https://pbs.twimg.com/media/CU1zsMSUAAAS0qW.jpg | Irish_terrier | 0.254856 | True | briard | 0.227716 | True | soft-coated_wheaten_terrier | 0.223263 | True |
| 241 | 670444955656130560 | https://pbs.twimg.com/media/CU3mITUWIAAfyQS.jpg | English_springer | 0.403698 | True | Brittany_spaniel | 0.347609 | True | Welsh_springer_spaniel | 0.137186 | True |
| 327 | 671896809300709376 | https://pbs.twimg.com/media/CVMOlMiWwAA4Yxl.jpg | chow | 0.243529 | True | hamster | 0.227150 | False | Pomeranian | 0.056057 | True |
| 382 | 673320132811366400 | https://pbs.twimg.com/media/CVgdFjNWEAAxmbq.jpg | Samoyed | 0.978833 | True | Pomeranian | 0.012763 | True | Eskimo_dog | 0.001853 | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1970 | 868880397819494401 | https://pbs.twimg.com/media/DA7iHL5U0AA1OQo.jpg | laptop | 0.153718 | False | French_bulldog | 0.099984 | True | printer | 0.077130 | False |
| 1992 | 873697596434513921 | https://pbs.twimg.com/media/DA7iHL5U0AA1OQo.jpg | laptop | 0.153718 | False | French_bulldog | 0.099984 | True | printer | 0.077130 | False |
| 2041 | 885311592912609280 | https://pbs.twimg.com/media/C4bTH6nWMAAX_bJ.jpg | Labrador_retriever | 0.908703 | True | seat_belt | 0.057091 | False | pug | 0.011933 | True |
| 2051 | 887473957103951883 | https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg | Pembroke | 0.809197 | True | Rhodesian_ridgeback | 0.054950 | True | beagle | 0.038915 | True |
| 2055 | 888202515573088257 | https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg | Pembroke | 0.809197 | True | Rhodesian_ridgeback | 0.054950 | True | beagle | 0.038915 | True |
132 rows × 11 columns
images_df_clean = images_df_clean.drop_duplicates(subset=['image_url'], keep='first')
images_df_clean.image_url.duplicated().sum()
0
twitter_df_clean["date"] = pd.to_datetime(twitter_df_clean["date"])
"You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used."
filtered_df = twitter_df_clean.loc[(twitter_df_clean['date'] > '2017-08-01')].any()
filtered_df
tweet_id False tweet False rating_numerator False rating_denominator False name False dog_type False date False dtype: bool
There is no data after 2017-08-01.
twitter_and_json=pd.merge(twitter_df_clean, json_df_clean, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x','_y'), copy=True, indicator=False, validate=None)
twitter_and_json.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2156 entries, 0 to 2155 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2156 non-null int64 1 tweet 2156 non-null object 2 rating_numerator 2156 non-null int64 3 rating_denominator 2156 non-null int64 4 name 1437 non-null object 5 dog_type 344 non-null object 6 date 2156 non-null datetime64[ns] 7 id_str 2156 non-null int64 8 retweet_count 2156 non-null int64 9 favorite_count 2156 non-null int64 10 retweeted_status 0 non-null object dtypes: datetime64[ns](1), int64(6), object(4) memory usage: 202.1+ KB
master_df =pd.merge(twitter_and_json, images_df_clean, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,validate=None)
twitter_df_clean.info(), images_df_clean.info(), json_df_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2156 entries, 0 to 2355 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2156 non-null int64 1 tweet 2156 non-null object 2 rating_numerator 2156 non-null int64 3 rating_denominator 2156 non-null int64 4 name 1437 non-null object 5 dog_type 344 non-null object 6 date 2156 non-null datetime64[ns] dtypes: datetime64[ns](1), int64(3), object(3) memory usage: 134.8+ KB <class 'pandas.core.frame.DataFrame'> Int64Index: 2009 entries, 0 to 2074 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2009 non-null int64 1 image_url 2009 non-null object 2 first_prediction 2009 non-null object 3 first_confidence 2009 non-null float64 4 first_dog_prediction 2009 non-null bool 5 second_prediction 2009 non-null object 6 second_confidence 2009 non-null float64 7 second_dog_prediction 2009 non-null bool 8 third_prediction 2009 non-null object 9 third_confidence 2009 non-null float64 10 third_dog_prediction 2009 non-null bool dtypes: bool(3), float64(3), int64(1), object(4) memory usage: 147.1+ KB <class 'pandas.core.frame.DataFrame'> RangeIndex: 2354 entries, 0 to 2353 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2354 non-null int64 1 id_str 2354 non-null int64 2 retweet_count 2354 non-null int64 3 favorite_count 2354 non-null int64 4 retweeted_status 179 non-null object dtypes: int64(4), object(1) memory usage: 92.1+ KB
(None, None, None)
master_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1978 entries, 0 to 1977 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 1978 non-null int64 1 tweet 1978 non-null object 2 rating_numerator 1978 non-null int64 3 rating_denominator 1978 non-null int64 4 name 1390 non-null object 5 dog_type 306 non-null object 6 date 1978 non-null datetime64[ns] 7 id_str 1978 non-null int64 8 retweet_count 1978 non-null int64 9 favorite_count 1978 non-null int64 10 retweeted_status 0 non-null object 11 image_url 1978 non-null object 12 first_prediction 1978 non-null object 13 first_confidence 1978 non-null float64 14 first_dog_prediction 1978 non-null bool 15 second_prediction 1978 non-null object 16 second_confidence 1978 non-null float64 17 second_dog_prediction 1978 non-null bool 18 third_prediction 1978 non-null object 19 third_confidence 1978 non-null float64 20 third_dog_prediction 1978 non-null bool dtypes: bool(3), datetime64[ns](1), float64(3), int64(6), object(8) memory usage: 299.4+ KB
master_df['name'] #checking for the "name" column.
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
1973 Franklin
1974 Darla
1975 Archie
1976 Tilly
1977 Phineas
Name: name, Length: 1978, dtype: object
master_df.apply(lambda x: x == None).any() #checking for any 'None' value left.
tweet_id False tweet False rating_numerator False rating_denominator False name False dog_type False date False id_str False retweet_count False favorite_count False retweeted_status False image_url False first_prediction False first_confidence False first_dog_prediction False second_prediction False second_confidence False second_dog_prediction False third_prediction False third_confidence False third_dog_prediction False dtype: bool
master_df.rating_denominator.max() #checking for whether any value greater than 10 in denominator column.
10
All the columns will not be used from all datasets deleted.
"retweeted_status" not null from all datasets deleted.
Names that not refer a name column changed with NaN.
None values for these stages replaced with NaN.
Timestamp column splitting and there is only date column currently.
All data types corrected.
"rating_denominator" set as max 10.
Nondescriptive column names renamed.
Dog stages merged to a single column and named as "dog_stage".
All datasets are merged.
doc = master_df.to_csv('twitter_archive_master.csv', index=False, encoding = 'utf-8')
You must produce at least three (3) insights and one (1) visualization. You must clearly document the piece of assessed and cleaned (if necessary) data used to make each analysis and visualization.
descriptive_statistics= master_df.drop(['tweet_id', 'id_str'], axis=1)
descriptive_statistics.describe()
| rating_numerator | rating_denominator | retweet_count | favorite_count | first_confidence | second_confidence | third_confidence | |
|---|---|---|---|---|---|---|---|
| count | 1978.000000 | 1978.000000 | 1978.000000 | 1978.000000 | 1978.000000 | 1.978000e+03 | 1.978000e+03 |
| mean | 11.699191 | 9.994439 | 2767.346309 | 8915.103640 | 0.593920 | 1.346733e-01 | 6.015502e-02 |
| std | 40.832225 | 0.192077 | 4681.073745 | 12244.493841 | 0.272085 | 1.007869e-01 | 5.075772e-02 |
| min | 0.000000 | 2.000000 | 16.000000 | 81.000000 | 0.044333 | 1.011300e-08 | 1.740170e-10 |
| 25% | 10.000000 | 10.000000 | 622.250000 | 1956.250000 | 0.362656 | 5.407533e-02 | 1.606823e-02 |
| 50% | 11.000000 | 10.000000 | 1354.500000 | 4141.000000 | 0.587635 | 1.178485e-01 | 4.950530e-02 |
| 75% | 12.000000 | 10.000000 | 3223.000000 | 11326.500000 | 0.846285 | 1.955197e-01 | 9.159438e-02 |
| max | 1776.000000 | 10.000000 | 79515.000000 | 132810.000000 | 1.000000 | 4.880140e-01 | 2.710420e-01 |
descriptive_statistics results indicates :
Average rating for a dog is 11.699 out of 9.99.
Average retweet count is 4681 and maximum 79515, while avg fav count is 8915 and maximum 132810.
According to confidence rates, the first one is on average 59% and increases as second is 134%, third is 601%. These are great prediction rates enogh to be reliable.
Which dog is the outlier regarding to ratings?
outlier = master_df[master_df['rating_numerator'] == 1776]
outlier
| tweet_id | tweet | rating_numerator | rating_denominator | name | dog_type | date | id_str | retweet_count | favorite_count | ... | image_url | first_prediction | first_confidence | first_dog_prediction | second_prediction | second_confidence | second_dog_prediction | third_prediction | third_confidence | third_dog_prediction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1253 | 749981277374128128 | This is Atticus. He's quite simply America af.... | 1776 | 10 | Atticus | NaN | 2016-07-04 | 749981277374128128 | 2772 | 5569 | ... | https://pbs.twimg.com/media/CmgBZ7kWcAAlzFD.jpg | bow_tie | 0.533941 | False | sunglasses | 0.080822 | False | sunglass | 0.050776 | False |
1 rows × 21 columns
from PIL import Image
import requests
from io import BytesIO
url = master_df.image_url[1253]
response = requests.get(url)
outlier_dog_image = Image.open(BytesIO(response.content))
outlier_dog_image
Most common breed on We Rating Dogs
y = list(master_df['third_prediction'].value_counts().head(5))
x = list(master_df['third_prediction'].head(5))
plt.barh(x,y)
plt.title('Dogs comparison')
plt.ylabel('Dogs')
plt.show()
Shetland sheepdog breed is the most common one.
Are predictions and images matching?
shetland = master_df.query('third_prediction == "Shetland_sheepdog"')
shetland.sort_values('third_confidence', ascending=False).head(2)
| tweet_id | tweet | rating_numerator | rating_denominator | name | dog_type | date | id_str | retweet_count | favorite_count | ... | image_url | first_prediction | first_confidence | first_dog_prediction | second_prediction | second_confidence | second_dog_prediction | third_prediction | third_confidence | third_dog_prediction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1546 | 799757965289017345 | This is Sobe. She's a h*ckin happy doggo. Only... | 13 | 10 | Sobe | doggo | 2016-11-18 | 799757965289017344 | 2506 | 9390 | ... | https://pbs.twimg.com/media/CxlPnoSUcAEXf1i.jpg | Border_collie | 0.442534 | True | collie | 0.288684 | True | Shetland_sheepdog | 0.196399 | True |
| 1649 | 819588359383371776 | This is Jazzy. She just found out that sandwic... | 13 | 10 | Jazzy | NaN | 2017-01-12 | 819588359383371776 | 2271 | 10606 | ... | https://pbs.twimg.com/media/C1_DQn3UoAIoJy7.jpg | Cardigan | 0.547935 | True | basenji | 0.116442 | True | Shetland_sheepdog | 0.101681 | True |
2 rows × 21 columns
url = master_df.image_url[1546]
response = requests.get(url)
img = Image.open(BytesIO(response.content))
img
url = master_df.image_url[1649]
response = requests.get(url)
img2 = Image.open(BytesIO(response.content))
img2
img has the highest and img2 has the second highest probabilty of being the dog is a Shetland sheepdog regarding to third_confidence, and they are.
shetland.sort_values('first_confidence', ascending=False).head(1)
| tweet_id | tweet | rating_numerator | rating_denominator | name | dog_type | date | id_str | retweet_count | favorite_count | ... | image_url | first_prediction | first_confidence | first_dog_prediction | second_prediction | second_confidence | second_dog_prediction | third_prediction | third_confidence | third_dog_prediction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1785 | 844979544864018432 | PUPDATE: I'm proud to announce that Toby is 23... | 13 | 10 | NaN | NaN | 2017-03-23 | 844979544864018432 | 2909 | 14738 | ... | https://pbs.twimg.com/media/C7n4aQ0VAAAohkL.jpg | tennis_ball | 0.999281 | False | racket | 0.00037 | False | Shetland_sheepdog | 0.000132 | True |
1 rows × 21 columns
url = master_df.image_url[1785]
response = requests.get(url)
img3 = Image.open(BytesIO(response.content))
img3
img3 has the highest probablity of the dog is a Shetland sheepdog regarding to first_confidence(on avg. approximately 60%). This is also true.
Most popular dog types on We Rating Dogs
pupper_counts = master_df.dog_type.value_counts()['pupper']
non_empty_counts1 = master_df.dog_type.count()
pupper_counts/non_empty_counts1
0.6633986928104575
66% of the dog types is pupper. Tweets are mostly about puppers, i.e. puppies.
doggo_counts = master_df.dog_type.value_counts()['doggo']
doggo_counts/non_empty_counts1
0.20588235294117646
As the second common type, doggo, is only 20% of the total dog types.
labels = np.full(len(master_df.dog_type.value_counts()), "", dtype=object)
labels[0] = 'pupper'
labels[1] = 'doggo'
labels[2] = 'puppo'
master_df.dog_type.value_counts().plot(kind="pie", labels=labels)
<AxesSubplot:ylabel='dog_type'>
The most popular dog name : Charlie
from collections import Counter
dog_name = master_df['name']
count = Counter(dog_name)
count.most_common(2)
[(nan, 588), ('Charlie', 11)]